Stable biomarker discovery in multi-omics data via canonical correlation analysis

Multi-omics analysis offers a promising avenue to a better understanding of complex biological phenomena. In particular, untangling the pathophysiology of multifactorial health conditions such as the inflammatory bowel disease (IBD) could benefit from simultaneous consideration of several omics levels. However, taking full advantage of multi-omics data requires the adoption of suitable new tools. Multi-view learning, a machine learning technique that natively joins together heterogeneous data, is a natural source for such methods. Here we present a new approach to variable selection in unsupervised multi-view learning by applying stability selection to canonical correlation analysis (CCA). We apply our method, StabilityCCA, to simulated and real multi-omics data, and demonstrate its ability to find relevant variables and improve the stability of variable selection. In a case study on an IBD microbiome data set, we link together metagenomics and metabolomics, revealing a connection between their joint structure and the disease, and identifying potential biomarkers. Our results showcase the usefulness of multi-view learning in multi-omics analysis and demonstrate StabilityCCA as a powerful tool for biomarker discovery.


Introduction
Multi-omics analyses have emerged as the next generation of high-throughput methods.By simultaneously considering multiple levels of the biological system under study, multi-omics offers a more comprehensive view of it, and a better chance of understanding the underlying processes [1][2][3].
However, along with promise come challenges: issues typical of biological data sets such as small sample sizes combined with a large number of variables are further compounded in multi-omics data [4].Heterogeneous data types and large discrepancies in the number of variables between different omics are also inherent problems [5,6].Taking full advantage of multi-omics necessitates developing and adopting new analytical methods [7,8].
The large number of variables encountered in biological data necessitates feature selection, and the primary goal of omics studies is often precisely the identification of a core set of variables of interest: biomarker discovery.This can be achieved via regularisation of models and imposing sparsity.However, picking the correct level of sparsity is a non-trivial task.In addition, there exists a known trade-off between sparsity and stability: how sensitive is the choice of variables to changes in the data [9,10].
Ideally, multi-omics analyses will go beyond simply concatenating different data types, preserving their complementary nature.Multi-view learning, the umbrella term for machine learning methods that consider the integration of multiple feature sets, is well-suited to accomplish this task [3,11].
One potential solution to the sparsity-stability trade-off is offered by the stability selection framework [12,13]: by simulating data perturbation with a subsampling procedure, more stable feature sets can be found.Here we show how stability selection can be efficiently applied in a multi-view context.
Our approach, named StabilityCCA, combines canonical correlation analysis (CCA) with stability selection to find subsets of connected variables in a multi-omics data set.CCA is an unsupervised multi-view method that finds a pair of maximally correlated projections for two sets of variables [14].By applying the stability selection framework to CCA, we offer an alternative approach to variable selection in high-dimensional multi-view analysis.Instead of setting a fixed level of model sparsity by optimising regularisation parameters, StabilityCCA constructs a multitude of models with varying levels of sparsity, giving a more holistic view of variable importance.
We tested StabilityCCA with simulated data and two different types of multi-omics from an inflammatory bowel disease (IBD) gut microbiome data set.The results show improvements in variable selection performance, in particular selection stability.Importantly, we saw significant improvements in stability in real data.We further demonstrate the method by analysing the canonical correlations found in the IBD data, uncovering a connection between the metagenome-metabolome structure and IBD status.With StabilityCCA, we were able to identify the important variables behind this connection, many of which are known IBD biomarkers.
This article is organised as follows.In a theory section, we introduce CCA, sparse CCA and stability selection, and show how stability selection can be applied to sparse CCA, leading to StabilityCCA.In a results section, we show results for two types of simulated data and a real multi-omics data set, comparing two different sparse CCA methods to their StabilityCCA extensions.The article ends with a discussion of the results.

Canonical correlation analysis
Canonical correlation analysis is an unsupervised dimensionality reduction method that can be applied when the variables under study naturally group into two disjoint sets.Suppose we have two corresponding sets of observations X 2 R n�p x and Y 2 R n�p y where n stands for the number of observations, and p x and p y for the number of variables in X and Y views respectively.We assume that the columns of X and Y have been standardised to have mean zero.CCA seeks projections X a and Y b such that the correlation between them is maximised: where h�, �i denotes the dot product.The solution to Eq (1) a 2 R p x and b 2 R p y are the first pair of canonical coefficients, and the projections u ¼ Xa 2 R n and v ¼ Yb 2 R n the first pair of canonical variables or scores.Further pairs can then be found by requiring that they are uncorrelated with the previous ones, up to min({rank(X), rank(Y)}) pairs.
The CCA problem can be solved analytically, for example using the singular value decomposition [15].However, when the number of variables exceeds that of observations, this solution is no longer unique.A classic remedy is to introduce some form of regularisation, typically inducing sparsity in the coefficient vectors a and b, and maximise the CCA objective via optimisation [16][17][18][19], resulting in sparse CCA (SCCA).
Rather than adding a penalty term to the objective, as is commonly done in regularised regression methods, many SCCA approaches instead formulate the regularisation into a set of constraints on the coefficient vectors a and b.An optimal solution can then be found via alternating optimisation in which the objective is iteratively improved from one side at a time, imposing the regularisation constraints at each step.Such an approach has been previously employed by [18], in a closely related formulation by [20], and in the context of kernel CCA by [19].
We will adopt and test two different SCCA approaches to be used as the base procedure with StabilityCCA.The first one, due to Witten et al., is based on a penalised matrix decomposition, and makes the simplifying assumption that the covariance matrices X T X and Y T Y can be treated as diagonal, effectively maximising the numerator in Eq (1).A combination of L 1and L 2 -norm constraints is used to induce sparsity, giving the following objective: where c x and c y are regularisation parameters for a and b respectively.This can be optimised by an alternating algorithm where at each step one of the coefficient vectors is kept fixed and Eq 2 maximised which can be done exactly (for full details of the algorithm, see [18]).We will refer to this method as penalised matrix decomposition CCA (PMD-CCA).PMD-CCA has several attractive qualities for our purposes.Firstly, when 1 < c x < ffi ffi ffi ffi p x p , the L 1 and L 2 constraints interact, setting some coefficients in a to zero, giving us the feature that the coefficient vector will be maximally sparse at c x � 1, and no sparsity is imposed at c x � ffi ffi ffi ffi p x p .This will prove very convenient for StabilityCCA.Secondly, the optimisation algorithm has low time complexity and appears to converge fast when a good initialisation is given (as proposed by [18], first singular vectors of the cross-covariance matrix appear to be a good choice in practice).This is important since the stability selection framework will necessarily add some further computational burden.
Our second SCCA method is essentially a combination of the approach used by [19] and PMD-CCA.That is, we retain the L 1 -L 2 constraints but solve for the full CCA objective: Optimisation is done using an alternating projected gradient algorithm (for details, see [19]).We will refer to this method as SCCA-EC for "SCCA with Elastic-net Constraints".
The SCCA-EC method has the same convenient qualities regarding the regularisation and low time complexity.However, contrary to PMD-CCA it does not ignore the within view covariances.It is not exactly clear when this assumption is appropriate and so it is interesting to see if and when it makes a difference.Furthermore, previous results suggest that different SCCA methods might be better suited for different types of data [21], so it makes sense to try different approaches with the StabilityCCA framework.

Stability selection
One potential solution to the sparsity-stability trade-off is offered by the stability selection framework, described in detail in [12,13].The idea is to simulate data perturbation by repeatedly subsampling the data.For each subsample, a regularisation path is calculated, forming a sequence of models with varying levels of sparsity.By recording the frequency with which a variable was selected at a given level of sparsity, we can estimate the probability of each variable to be selected as a function of regularisation.
Here we show how to apply stability selection to CCA.As introduced in the previous Section, we will test two different SCCA approaches as base procedures but the framework can in principle be applied with any SCCA method.
Since we have two sparsity parameters, we sample the parameter space using pairs of parameter values (c xi , c yi ).To capture the full range of regularisation, we run the parameters from (1, 1) to ð ffi ffi ffi ffi p x p ; ffi ffi ffi ffi p y p Þ, populating the interval using a logarithmic sequence.This will sample the lower end of the parameter space more densely, effectively giving it more weight, since we are more interested in relatively sparse models where only a few variables are being selected.The procedure, StabilityCCA, is described in pseudocode in Algorithm 1 and illustrated schematically in Fig 1 which shows a visual representation of the output: a stability path.It shows for each variable its selection probability, as estimated by StabilityCCA, as a function of the sparsity parameters.
We define the stability score of a variable as the area-under-curve of its stability path.In other words, it is the average selection probability across all (c x , c y ) values sampled.It can be seen as a measure of both relative and absolute variable importance: variables with higher stability scores are selected more often and thus more likely to be true signal variables.Values close to one indicate that a variable is selected almost always, even in very sparse models, and is likely to be very important for the model.
We tested two variable selection strategies based on the stability score, both using a single hyperparameter.In "Top-k" selection, variables are ranked and the top-k variables are selected, k being the hyperparamater.In threshold selection (Thr), variables with a stability score above a threshold τ will be selected, and τ is the hyperparameter.With both strategies, a CCA model can then be formed by running the base procedure with only the selected variables and regularisation parameters set so that no further sparsity is induced.

Simulated data
Data.We generated simulated data with a known ground truth using two different data models.The first we will refer to as "single latent data".Let a true and b true be ground truth coefficient vectors for the first (X) and second (Y) view respectively, and p true x and p true y the number of true variables.We randomly select p true x indices A true in a true to be the true variables (same for b true ).The entries A true i are then drawn from the uniform distribution U ½0;1� if j 2 a true , and set to zero otherwise (same for b true ).A latent factor Z i � N ð0; 1Þ is drawn for each sample i and the data is generated with X ij � N ðZ i a true j ; 0:5Þ and Y ij � N ðZ i b true j ; 0:5Þ.In the second data model, "multivariate data", a true and b true are generated in the same manner.Data is then drawn from a multivariate Gaussian distribution N ð0; SÞ where and equivalently for S YY and S YX .
Let now s x be the true sparsity of the X-view, that is, s x ¼ p true x =p x (and equivalently for Y).For all simulations, we set s x = 2s y , p true x ¼ 10 and p true y ¼ 15, and run simulations for s x values 1/2, 1/4 and 1/10.We set n = 50 in all cases.This set-up is meant to capture to varying degrees of severity many of the potential challenges encountered in multi-omics data: the curse of dimensionality, and asymmetry in the number of true variables, the level of true sparsity and number of variables between the two views.
Metrics.We evaluated the methods' ability to identify true variables using two metrics.The area under the receiver operating characteristic curve (AUC) was calculated using the absolute values of canonical coefficients as the score for SCCA-EC and PMD-CCA, and the stability scores for their StabilityCCA equivalents.Balanced accuracy (BA) was calculated for binarised canonical coefficients or for the selections performed with the Top-k and the Thr procedure for StabilityCCA.
The stability of selected variable sets was evaluated using the stability estimator introduced by [22].The stability estimator is a value between 1 and À 1 MÀ 1 where M is the number of variable sets.The maximum value 1 is achieved only when all selected sets are exactly equal and the minimum when their intersection is empty.The expected value of a random selection where each subset is equally likely to be drawn is zero.
Finally, model performance was evaluated by applying the canonical coefficients to a test set of size 100 and calculating the resulting correlation.For each scenario, we generated ten different random ground truths a true ,b true , and ten different training and test set pairs for each a true ,b true .Results are reported as averages over these 100 training/test sets for AUC, BA and test set correlation.The stability estimator was calculated over each of the sets of ten training sets with the same ground truth and we report the average over these.
Hyperparameter tuning.All hyperparameter tuning was done within the training set using ten rounds of 3-fold cross-validation and test set canonical correlation as metric of success.For the base procedures alone, SCCA-EC and PMD-CCA, a grid search over evenly distributed values of c x and c y in the range ½1; ffi ffi ffi ffi p x p � and ½1; ffi ffi ffi ffi p y p � was performed with grid size 15.For the Thr selection procedure, 100 values evenly distributed between the highest and lowest stability score were tested.For Top-k selection, since the number of potential k values can be very large, a binary search strategy was employed between 1 and p x + p y .Note that in practice, one would likely aim for low values of k.However, we chose this agnostic tuning strategy so as to not artificially bias the model towards sparse models.
The results for single latent data are shown in Fig 2A .For SCCA-EC, we observe a substantial improvement from StabilityCCA across all three variable selection metrics in all scenarios with not much difference between the Top-k and Thr models.For test set correlation there is little difference with the base procedure performing slightly better in the sparsest case.
Similar, if not as pronounced improvements are found for PMD-CCA.Notably, PMD-CCA alone clearly outperforms SCCA-EC, and even its StabilityCCA extension in some cases.All in all, we observe clear improvements from StabilityCCA in variable selection, in particular stability, with not much difference in model performance, independent of the SCCA method used.
The multivariate data results, shown in Fig 2B, are more mixed.For SCCA-EC, while stability is still improved, AUC and BA are in some cases higher for the base procedure alone.It appears that SCCA-EC benefits from stability selection more in sparser scenarios.Test set correlation on the other hand is always higher for the base procedure model.
For PMD-CCA, there is still a clear improvement in AUC.The Top-k and Thr selections diverge somewhat: the Top-k model has much better BA in the sparse cases, and is the most stable in the densest case while dropping even below the base procedure for sparser cases.The Thr model has more even stability performance and outperforms Top-k for test set correlation.Interestingly, the results also differ from the single latent data in that SCCA-EC now outperforms PMD-CCA for AUC and test set correlation.In summary, in our simulated data, adding the stability selection framework to SCCA seems to improve variable selection and in particular selection stability with not much difference in model performance.It appears that PMD-CCA is more stable than SCCA-EC to begin with, and does not benefit as much from StabilityCCA.

Real data
We further evaluated StabilityCCA on two different omics combinations from an IBD data set.IBD is a group of gastrointestinal conditions, the most common forms being Crohn's disease (CD) and ulcerative colitis (UC).The data comprises fecal metagenomics and metabolomics of individuals with CD (88), UC (76) and healthy controls (56), and was published in [23].There are two alternative feature sets for metagenomics: species (201 variables) or enzymes (2113 variables), and 466 metabolite variables.
We downloaded abundance tables from the electronic supplementary materials to [23].All data was transformed using the generalised log transformation: glogðxÞ ¼ logðx þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 24,25].Variables which had zero variance were removed.Finally, each column (variable) was normalised to have mean zero and variance one.
Since here we obviously do not have a ground truth to compare against, only stability and model performance were evaluated.Non-overlapping training and test sets of size 100 were randomly drawn 25 times.Stability was measured across the selections in these 25 sets and test set correlation is reported as the average over them as before.Hyperparameter tuning was the same as in the simulation experiments.
The results for the species-metabolites case are shown in Fig 3A .For SCCA-EC, there is again a very notable improvement in stability from StabilityCCA, with the Top-k model being slightly more stable than Thr.There is little change in model performance.For PMD-CCA, the base procedure is now again much more stable than SCCA-EC.The Thr-extension underperforms, having worse stability than the base procedure alone while PMD-CCA-Top-k has the highest stability of all methods.However, for model performance, SCCA-EC outperforms across the board.In the enzymes-metabolites case in Fig 3B , SCCA-EC performance is very similar to the previous case.For PMD-CCA, the base procedure selection is no longer stable.While stability is again improved by the Top-k selection, both SCCA-EC extensions are now more stable.In terms of model performance, SCCA-EC is now clearly better for both the base procedure and the StabilityCCA extensions.
All in all, our experiments on real data show that StabilityCCA greatly improves variable selection stability with no difference or even small improvements in model performance.The Top-k model appears to be the better selection strategy.The results also suggest that of the two SCCA methods tested, SCCA-EC might be better suited for these types of multi-omics data.

Canonical correlations are linked to disease status
We further illustrate the use of StabilityCCA by taking a closer look at the IBD data set.Based on the results presented above, we choose SCCA-EC as the base procedure and Top-k as the variable selection method.However, as mentioned before, in practice we would likely not wish to attempt to optimise k over all possible values but rather choose a relatively low value based on practical considerations such as ease of interpretability.
Indeed, when we investigated how increasing k will affect model performance, we found that while the average test set correlation does keep increasing up to k = 150, very little is gained beyond k = 50, and most of the canonical correlation is already captured with just ten variables (see Fig 4A).Moreover, as we see in Fig 5A and 5B, the top-50 and top-10 models show qualitatively the same behaviour.
The plots show the canonical variables for the two views, species-composition of the metagenome and metabolomics.Each point corresponds to a sample and the IBD status of the individual has been highlighted.We see that there is a clustering happening, separating IBD samples from controls.To quantify this pattern, we calculated the receiver operating characteristic curves for disease status using the sum of the canonical variables, X a + Y a, as the score: the AUC was 0.88 (IBD v Control, 0.94 for CD v Control and 0.81 for UC v Control) for the  top-10 model and 0.86 for the top-50 model (0.93 for CD v Control and 0.79 for UC v Control).This suggests several things: the uncovered connections between the metagenome and the metabolome are strongly linked to disease status in the IBD data, and while this connection is potentially very high-dimensional, its core can be captured by the top variables highlighted by StabilityCCA.
Table 1 shows the top-10 variables and their stability scores.The highest ranked variable, species Subdoligranulum unspecified, has a score close to one, indicating that it is almost always selected.Another prominent variable is the metabolite urobilin.Fig 5C displays the canonical coefficients and the variables' pairwise contributions: the contribution from variable j of the X-view (species) and variable k from the Y-view (metabolites) is and it is the proportion of the total canonical correlation (Eq (1)) due to the correlation between j and k.Accordingly, the highest contributions come from Subdoligranulum unspecified, and highest pairwise contributions are between Subdoligranulum and urobilin.Many of the top-10 variables have already been previously connected to IBD.More specifically, previous studies have shown a connection between IBD and decreases in the abundance or levels of Subdoligranulum species [26][27][28][29], urobilin [30,31], hydrocinnamic acid [31], the genus Coprococcus [32][33][34][35][36], Ruminococcus bromii [27,37], and Oscillibacter [36,38].
A similar pattern is found when the enzymes feature set is used as the metagenomics view.Table 2 shows the top-10 variables and their stability scores.Many of the same metabolites are present.Of the new ones, lithocholate potentially inhibits inflammation [39,40] and alphahydroxybutyrate has been found to be elevated in IBD [41,42].The canonical coefficients and pairwise contributions are shown in Fig 6C .The highest contribution is due to lithocholate Given that the same patterns were observed with both the species and the enzymes representation of the metagenome, it would be interesting to relate the two to each other-after all,

Discussion
We have presented a novel approach to variable selection in multi-view data: StabilityCCA.The method combines stability selection with sparse CCA to assess variable importance and  select stable sets of variables.We tested the method with two different sparse CCA approaches, using both simulated and real data.The results show that by applying the stability selection framework, we can improve the ability to select true variables as well as the stability of the selection, regardless of the sparse CCA approach used.Importantly, in real multi-omics data, stability of selected feature sets was greatly improved by StabilityCCA.We applied StabilityCCA to an IBD gut microbiome data set.Suggested causes for IBD include interactions between genetic and environmental factors [43].In particular, the microbiome has been heavily implicated, with changes in microbiota composition and diversity being routinely observed.However, the exact causative mechanisms remain unknown [44,45].Since the metabolome acts as the interface mediating host and microbiota interaction, a metagenome-metabolome multi-omics analysis is a promising avenue to a better understanding of IBD pathology.
We found the canonical correlations between the gut metagenome and metabolome to be strongly linked to the disease.The connection was stronger for Crohn's disease than for ulcerative colitis, in line with previous analysis of the same data [23].With StabilityCCA, we were able to pinpoint a relatively small set of variables most responsible for this observed structure, many of which have already been linked to IBD.
While the exact aetiology of IBD remains unclear, the role of butyrate and other shortchain fatty acids (SCFAs) in reducing inflammation and maintaining the integrity of the epithelium barrier has been extensively investigated [46].Accordingly, at least two of the microbial species highlighted by our analysis, Subdoligranulum unspecified and Ruminococcus bromii, have been identified as SCFA producers [26,37].The hypothesis is that a reduced abundance of these beneficial bacteria lead to decreased levels of SCFAs, which in turn lead to inflammation and epithelium breakdown.
In a similar vein, lithocholate, the top variable in the ezymes-metabolites scenario, has been proposed to inhibit epithelial apoptosis and promote barrier function [39,40].Another metabolite within the enzymes-metabolites top-10, alpha-hydroxybutyrate, is a potential biomarker for autoimmune disease [42].Futhermore, it is part of propanoate metabolism: changes in propanoate and butanoate metabolism within the microbiome have been linked to IBD [45].
There are some important limitations to our analysis.As an unsupervised method, CCA, and by extension StabilityCCA, may not optimally fit settings that require supervised prediction.Also, the canonical correlations cannot be directly interpreted as causal dependencies, and hence further study is needed to provide a full picture of the underlying biological significance of the findings.
Additionally, we have focussed here on extending stability selection to CCA which is designed for two views only.There are several alternative ways to extend CCA to three or more views [47,48], and further research is needed to understand whether our approach could be used with these methods as well.
The fact that the connection between the microbiome and IBD arises "serendipitously" from the data-meaning that disease status was not part of the analysis-arguably lends further credence both to the hypothesis that IBD is connected to the microbiota and the theory that this influence is mediated by the metabolome.By revealing a joint structure of the metagenome and the metabolome, the model points to a good starting point for future study which will hopefully shine more light on the exact mechanisms behind IBD, demonstrating the promise in multi-view analysis.

Fig 1 .
Fig 1.Schematic representation of the StabilityCCA procedure.Above are shown regularisation paths: the SCCA model as a function of sparsity.Below, stability paths, derived from the regularisation paths for 100 subsamples of size n/2.The stability score of a variable is the area-under-curve (AUC) of its stability path.https://doi.org/10.1371/journal.pone.0309921.g001

Fig 4 .
Fig 4. Top-k model performance for different k values in the IBD data set.The constant lines correspond to the average performance when an optimal value k* was selected through hyperparameter tuning.https://doi.org/10.1371/journal.pone.0309921.g004

Fig 5 .
Fig 5. StabilityCCA models for species-metabolites IBD data.(A) Top-50 model canonical variables.The canonical correlation (CC) is shown above the plot.(B) Top-10 model canonical variables.(C) Canonical coefficients and pairwise contributions to the canonical correlation for the top-10 model.(*) match to a standard with isomeric forms that could not be differentiated.https://doi.org/10.1371/journal.pone.0309921.g005

Fig 6
shows the canonical variables for a top-100 and a top-10 model.Again using the sum of the variables as a score, the AUC is 0.90 (IBD v Control, 0.95 for CD v Control and 0.84 for UC v Control) for the top-100 model and 0.88 for the top-10 model (0.94 for CD v Control and 0.81 for UC v Control), showing that while in this case more top variables are needed to achieve the highest test correlations (see Fig 4B), the underlying biological pattern is already captured by just ten.

Fig 6 .
Fig 6.StabilityCCA models for enzymes-metabolites IBD data.(A) Top-100 model canonical variables.The canonical correlation (CC) is shown above the plot.(A) Top-10 model canonical variables.(C) Canonical coefficients and pairwise contributions to the canonical correlation for the top-10 model.(*) match to a standard with isomeric forms that could not be differentiated.https://doi.org/10.1371/journal.pone.0309921.g006

Fig 7 .
Fig 7. Canonical variables (A) and coefficients (B) for the species and enzymes views, taken from the top-10 species-metabolites model and the top-10 enzymes-metabolites model.https://doi.org/10.1371/journal.pone.0309921.g007 is a set of 100 subsamples derived by randomly splitting X and Y in half 50 times.

Table 1 . Top-10 variables from the species-metabolites data set.
Protein-disulfide reductase (glutathione)-the highest ranked variable and the highest ranked enzyme variable.